pycoQC API Usage
Running pycoQC in Jupyter notebook
If you want to run pycoQC interactively in Jupyter you need to install Jupyter manually. If you installed pycoQC in a virtual environment then install Jupyter in the same virtual environment.
pip3 install notebook
Launch the notebook in a shell terminal
jupyter notebook
If it does not auto-start, open the following URL in you favorite web browser http://localhost:8888/tree
From Jupyter homepage you can navigate to the directory you want to work in and create a new Python3 Notebook.
Then, to analyse your data follow the instructions given in the example usage notebook above.
Example files
pycoQC repository contains several example sequencing summary files generated with various version of Albacore and Guppy. Each of those files only contains 10,000 reads. * docs/demo/data/summary/Albacore-1.2.1_basecall-1D-DNA_sequencing_summary.txt.gz * docs/demo/data/summary/Albacore-1.2.3_basecall-1D-RNA_sequencing_summary.txt.gz * docs/demo/data/summary/Albacore-1.7.0_basecall-1D-DNA_sequencing_summary.txt.gz * docs/demo/data/summary/Albacore-2.1.10_basecall-1D-DNA_sequencing_summary.txt.gz * docs/demo/data/summary/Albacore-2.1.10_basecall-1D-RNA_sequencing_summary.txt.gz * docs/demo/data/summary/Albacore-2.3.1_basecall-1D-RNA_sequencing_summary.txt.gz * docs/demo/data/summary/Guppy-2.1.3_basecall-1D-DNA_sequencing_summary.txt.gz * docs/demo/data/summary/Guppy-2.1.3_basecall-1D-RNA_sequencing_summary.txt.gz
On top of these summary files for Guppy the barcode information are now stored in a separate barcoding summary file. There is one example in pycoQC: * docs/demo/data/summary/Guppy-2.1.3_basecall-1D_DNA_barcoding_summary.txt.gz
Larger versions of some of these files are also available from https://www.ebi.ac.uk/~aleg/data/pycoQC_test/
- https://www.ebi.ac.uk/~aleg/data/pycoQC_test/Albacore-1.2.1_basecall-1D-DNA_sequencing_summary.txt.gz
- https://www.ebi.ac.uk/~aleg/data/pycoQC_test/Albacore-1.2.3_basecall-1D-RNA_sequencing_summary.txt.gz
- https://www.ebi.ac.uk/~aleg/data/pycoQC_test/Albacore-1.7.0_basecall-1D-DNA_sequencing_summary.txt.gz
- https://www.ebi.ac.uk/~aleg/data/pycoQC_test/Albacore-2.1.10_basecall-1D-DNA_sequencing_summary.txt.gz
- https://www.ebi.ac.uk/~aleg/data/pycoQC_test/Albacore-2.1.10_basecall-1D-RNA_sequencing_summary.txt.gz
- https://www.ebi.ac.uk/~aleg/data/pycoQC_test/Albacore-2.3.1_basecall-1D-RNA_sequencing_summary.txt.gz
General considerations
pycoQC is a simple class that is initialized with a text summary file generated by ONT Albacore or Guppy. For 1D run use the file named sequencing_summary.txt available the root of Albacore output directory. For 1D2, use sequencing_1dsq_summary.txt that cam be found in the 1dsq_analysis directory.
The instantiated object can be subsequently called with various methods that will generates tables and plots.
There are a few different ways to get help for all the public package functions:
* In a separate window with the jupyter magic "?": ?pycoQC.channels_activity
* In an output cell with the standard help function: help (pycoQC.channels_activity)
* Inline with the cursor on the function of interest use shift + tab
Imports
For plotly offline plotting
Import pycoQC main class as well as Plotly and enable inline plotting in the current Notebook.
This is the recommended option. This ensures that your all your data are stored inside the notebook.
The limitation is that if generating many plots with large datasets the notebook will become quite heavy and slow.
# Run cell with Ctrl + Enter # Import main pycoQC module from pycoQC.pycoQC import pycoQC # Import helper functions from pycoQC from pycoQC.common import jhelp # Import and setup plotly for offline plotting in Jupyter from plotly.offline import plot, iplot, init_notebook_mode init_notebook_mode (connected=False)
For plotly online plotting
This option takes advantage of Plotly web-service for hosting graphs. This requires to set up an account (https://plot.ly/python/getting-started/#initialization-for-online-plotting) and to provide credentials in the notebook. This could be a good option for easy sharing of the interactive plots generated by pycoQC.
# Only run this cell if you have set up a plotly account before and wants to use Plotly web-service # from plotly.plotly import plot, iplot # import plotly.tools as pt # pt.set_credentials_file (username="XXXXXXXXXX", api_key="XXXXXXXXXX")
Initialisation
Upon initialization pycoQC reads the sequencing summary file, runs a series of tests and pre-process the data for plotting methods.
jhelp (pycoQC)
pycoQC.pycoQC.init
Parse Albacore sequencing_summary.txt file and clean-up the data
- seq_summary_file : str (required)
Path to the sequencing_summary generated by Albacore 1.0.0 + (read_fast5_basecaller.py) / Guppy 2.1.3+ (guppy_basecaller). One can also pass multiple space separated file paths or a UNIX style regex matching multiple files
- barcode_summary_file : str (default = None)
Path to the barcode_summary_file generated by Guppy 2.1.3+ (guppy_barcoder). This is not a required file. One can also pass multiple space separated file paths or a UNIX style regex matching multiple files
- runid_list : list of str (default = [])
Select only specific runids to be analysed. Can also be used to force pycoQC to order the runids for temporal plots, if the sequencing_summary file contain several sucessive runs. By default pycoQC analyses all the runids in the file and uses the runid order as defined in the file.
- filter_calibration : bool (default = False)
If True read flagged as calibration strand by the software are removed
- min_pass_qual : int (default = 7)
Minimum quality to consider a read as 'pass'
- min_barcode_percent : float (default = 0.1)
Minimal percent of total reads to retain barcode label. If below the barcode value is set as unclassified.
- verbose_level : int {0,1,2} (default = 0)
Level of verbosity, from 2 (Chatty) to 0 (Nothing)
Basic initialisation
# Run cell with Ctrl + Enter p = pycoQC("./data/summary/Albacore-1.7.0_basecall-1D-DNA_sequencing_summary.txt.gz") print (p)
[pycoQC] Runtime info package_name: pycoQC package_version: 2.2.3.7 timestamp: 2019-05-07 12:30:01.913131 verbose_level: 0 min_barcode_percent: 0.1 filter_calibration: False min_pass_qual: 7 runid_list: [] barcode_summary_file: None seq_summary_file: ./data/summary/Albacore-1.7.0_basecall-1D-DNA_sequencing_summary.txt.gz Read counts Initial reads: 10000 Reads with NA values discarded: 0 Zero length reads discarded: 438 Reads with low frequency barcode unset: 0 Valid reads: 9562 Valid pass reads: 8042
Initialisation with calibration strand filtering out
# Run cell with Ctrl + Enter p = pycoQC("./data/summary/Albacore-1.7.0_basecall-1D-DNA_sequencing_summary.txt.gz", verbose_level=1, filter_calibration=True) print (p)
Importing raw data from sequencing summary files
Verifying fields and discarding unused columns
Droping lines containing NA values
Filtering out zero length reads
Filtering out calibration strand reads
Sorting run IDs by decreasing throughput
Reordering runids
Cleaning up low frequency barcodes
Reindexing dataframe by read_ids
[pycoQC]
Runtime info
package_name: pycoQC
package_version: 2.2.3.7
timestamp: 2019-05-07 12:30:02.131170
verbose_level: 1
min_barcode_percent: 0.1
filter_calibration: True
min_pass_qual: 7
runid_list: []
barcode_summary_file: None
seq_summary_file: ./data/summary/Albacore-1.7.0_basecall-1D-DNA_sequencing_summary.txt.gz
Read counts
Initial reads: 10000
Reads with NA values discarded: 0
Zero length reads discarded: 438
Calibration reads discarded: 2605
Reads with low frequency barcode unset: 0
Valid reads: 6957
Valid pass reads: 5594
Initialisation with summary file regex and maximum verbose level
# Run cell with Ctrl + Enter p = pycoQC("./data/summary/*RNA*", verbose_level=2)
Importing raw data from sequencing summary files
Sequencing summary files found: ['./data/summary/Albacore-2.1.10_basecall-1D-RNA_sequencing_summary.txt.gz', './data/summary/Albacore-1.2.3_basecall-1D-RNA_sequencing_summary.txt.gz', './data/summary/Albacore-2.3.1_basecall-1D-RNA_sequencing_summary.txt.gz', './data/summary/Guppy-2.1.3_basecall-1D-RNA_sequencing_summary.txt.gz']
40,000 reads found in initial file
Verifying fields and discarding unused columns
1D Run type
Columns found: ['read_id', 'run_id', 'channel', 'start_time', 'sequence_length_template', 'mean_qscore_template']
Droping lines containing NA values
0 reads discarded
Filtering out zero length reads
813 reads discarded
Sorting run IDs by decreasing throughput
Run-id order ['7ae4f0a6d2b7ba3e0248496b7de9cd5d1c028415', '5074e0cd71f372314c30ca5158aab2172d915023', '9835d20f1d205bdbd1fb4d464ae778de95beab24', 'c675730269f2f96f300f1cfa613fe89c53b344c3', '2b9163100702bba6ac29d37dbc96ccad740aa05d', 'd0054681152930b21276405d948b115e46968ca6', '71055637dd56eca9416305332eba1ed37bbfffe1', 'db5916f2fe7957afac1d0aaccdec883342c4bc31', '93fa1ad3ebc8a6e505d991bcb052c2b8ceb278b5', '17b317b994031430f350cda1dc13a72f66572ece']
Reordering runids
Processing reads with Run_ID 7ae4f0a6d2b7ba3e0248496b7de9cd5d1c028415 / time offset: 0
Processing reads with Run_ID 5074e0cd71f372314c30ca5158aab2172d915023 / time offset: 5309.74734
Processing reads with Run_ID 9835d20f1d205bdbd1fb4d464ae778de95beab24 / time offset: 15911.26726
Processing reads with Run_ID c675730269f2f96f300f1cfa613fe89c53b344c3 / time offset: 183649.42351
Processing reads with Run_ID 2b9163100702bba6ac29d37dbc96ccad740aa05d / time offset: 184044.468
Processing reads with Run_ID d0054681152930b21276405d948b115e46968ca6 / time offset: 184432.95738
Processing reads with Run_ID 71055637dd56eca9416305332eba1ed37bbfffe1 / time offset: 184828.68148
Processing reads with Run_ID db5916f2fe7957afac1d0aaccdec883342c4bc31 / time offset: 229024.07989
Processing reads with Run_ID 93fa1ad3ebc8a6e505d991bcb052c2b8ceb278b5 / time offset: 401812.72839
Processing reads with Run_ID 17b317b994031430f350cda1dc13a72f66572ece / time offset: 402176.51159
Reindexing dataframe by read_ids
[pycoQC]
Runtime info
package_name: pycoQC
package_version: 2.2.3.7
timestamp: 2019-05-07 12:30:02.350680
verbose_level: 2
min_barcode_percent: 0.1
filter_calibration: False
min_pass_qual: 7
runid_list: []
barcode_summary_file: None
seq_summary_file: ./data/summary/*RNA*
Read counts
Initial reads: 40000
Reads with NA values discarded: 0
Zero length reads discarded: 813
Valid reads: 39187
Valid pass reads: 31661
Initialisation with guppy barcoding file
# Run cell with Ctrl + Enter p = pycoQC( seq_summary_file="./data/summary/Guppy-2.1.3_basecall-1D-DNA_sequencing_summary.txt.gz", barcode_summary_file="./data/summary/Guppy-2.1.3_basecall-1D_DNA_barcoding_summary.txt.gz", verbose_level=1) print(p)
Importing raw data from sequencing summary files
Importing barcode information from barcode summary files
Verifying fields and discarding unused columns
Droping lines containing NA values
Sorting run IDs by decreasing throughput
Reordering runids
Cleaning up low frequency barcodes
Reindexing dataframe by read_ids
[pycoQC]
Runtime info
package_name: pycoQC
package_version: 2.2.3.7
timestamp: 2019-05-07 12:30:03.027233
verbose_level: 1
min_barcode_percent: 0.1
filter_calibration: False
min_pass_qual: 7
runid_list: []
barcode_summary_file: ./data/summary/Guppy-2.1.3_basecall-1D_DNA_barcoding_summary.txt.gz
seq_summary_file: ./data/summary/Guppy-2.1.3_basecall-1D-DNA_sequencing_summary.txt.gz
Read counts
Initial reads: 10000
Reads with barcodes: 9080
Reads with NA values discarded: 0
Reads with low frequency barcode unset: 0
Valid reads: 10000
Valid pass reads: 9065
Generating plots and tables
Interaction with Plotly library
Plots are generated with plotly for Python and return a plotly Figure object that can be used by users for:
* Further customization using the numerous methods attached to the Figure object
* Inline plotting in Jupyter Notebook using iplot (either from plotly.plotly or plotly.offline)
* Generating a separate HTML file with plot (either from plotly.plotly or plotly.offline)
* Exporting as a static image (https://plot.ly/python/static-image-export/), pdf (https://plot.ly/python/pdf-reports/) or various text formats.
In this notebook we will use the inline plotting option with the offline plotly library
Users can also customize the figures online in a user friendly environment by clicking on "Edit in Chart Studio" in the upper right corner of each figures.

Similarly static pictures can be exported using the "Download plot as a png" button.

Common arguments
All the methods have the arguments width and height that can be used to customize the plotting area. In general we do not recommend modifing these values as they might disrupt the plot layout.
Most of the methods also have the argument sample. By default pycoQC downsample the number of reads to 100,000 before plotting. This drastically reduces the processing time for large dataset and has a very limited impact on the plot aspect. The sampling is random but deterministic, meaning that you should always obtain the same results for the same dataset. The value can be changed to increase or decrease the number of reads. Alternatively, one can deactivate the behavior by specifying sample=False.
Data summary
The summary method generate a simple summary table with a clickable button to switch from "all reads" to "pass reads" only.
jhelp(pycoQC.summary)
pycoQC.pycoQC.summary
Plot an interactive summary table per runid
- groupby : str {run_id, barcode, None} (default = None)
Value of field to group the data in the table
- width : int (default = None)
With of the ploting area in pixel
- height : int (default = None)
height of the ploting area in pixel
- plot_title : str (default = Run summary)
Title to display on top of the plot
All reads summary
# Run cell with Ctrl + Enter p = pycoQC("./data/summary/Albacore-2.1.10_basecall-1D-DNA_sequencing_summary.txt.gz") fig = p.summary() iplot (fig, show_link=False)
Summary per run_id
# Run cell with Ctrl + Enter p = pycoQC("./data/summary/Albacore-2.1.10_basecall-1D-DNA_sequencing_summary.txt.gz") fig = p.summary(groupby="run_id") iplot (fig, show_link=False)
Summary per barcode
# Run cell with Ctrl + Enter p = pycoQC("./data/summary/Albacore-1.7.0_basecall-1D-DNA_sequencing_summary.txt.gz") fig = p.summary(groupby="barcode") iplot (fig, show_link=False)
Read Length and Mean quality distribution
pycoQC has 3 methods to visualize the distribution of mean quality scores and of estimated read length:
* reads_len_1D: An histogram of estimated read length in logarithmic scale
* reads_qual_1D: An histogram of mean quality scores
* reads_len_qual_2D: A density contour plot of estimated read length vs mean quality scores in semilog scale
Although we recommend to stick to default values, all 3 methods allow users to customize the plots.
* The numbers of bin to divide the reads quality and/or length space in can be specified with nbins for the 1D plots and len_nbins / qual_nbins for the 2D plot
* The intensity of line smoothing (using a gaussian kernel filter) can be specified
* Additional cosmetic customization are available: color/colorscale
jhelp(pycoQC.reads_len_1D)
pycoQC.pycoQC.reads_len_1D
Plot a distribution of read length (log scale)
- color : str (default = lightsteelblue)
Color of the area (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-color
- nbins : int (default = 200)
Number of bins to devide the x axis in
- smooth_sigma : float (default = 2)
standard deviation for Gaussian kernel
- sample : int (default = 100000)
If given, a n number of reads will be randomly selected instead of the entire dataset
- width (default = None)
With of the ploting area in pixel
- height : int (default = 500)
height of the ploting area in pixel
- plot_title : str (default = Distribution of read length)
Title to display on top of the plot
# Run cell with Ctrl + Enter p = pycoQC("./data/summary/Guppy-2.1.3_basecall-1D-DNA_sequencing_summary.txt.gz") fig = p.reads_len_1D() iplot(fig, show_link=False)
jhelp(pycoQC.reads_qual_1D)
pycoQC.pycoQC.reads_qual_1D
Plot a distribution of quality scores
- color : str (default = salmon)
Color of the area (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-color
- nbins : int (default = 200)
Number of bins to devide the x axis in
- smooth_sigma : float (default = 2)
standard deviation for Gaussian kernel
- sample : int (default = 100000)
If given, a n number of reads will be randomly selected instead of the entire dataset
- width : int (default = None)
With of the ploting area in pixel
- height : int (default = 500)
height of the ploting area in pixel
- plot_title : str (default = Distribution of read quality scores)
Title to display on top of the plot
# Run cell with Ctrl + Enter p = pycoQC("./data/summary/Albacore-2.1.10_basecall-1D-RNA_sequencing_summary.txt.gz") fig = p.reads_qual_1D() iplot(fig, show_link=False)
jhelp(pycoQC.reads_len_qual_2D)
pycoQC.pycoQC.reads_len_qual_2D
Plot a 2D distribution of quality scores vs length of the reads
- colorscale (default = [[0.0, 'rgba(255,255,255,0)'], [0.1, 'rgba(255,150,0,0)'], [0.25, 'rgb(255,100,0)'], [0.5, 'rgb(200,0,0)'], [0.75, 'rgb(120,0,0)'], [1.0, 'rgb(70,0,0)']])
a valid plotly color scale https://plot.ly/python/colorscales/ (Not recommanded to change)
- len_nbins : int (default = 200)
Number of bins to divide the read length values in (x axis)
- qual_nbins : int (default = 75)
Number of bins to divide the read quality values in (y axis)
- smooth_sigma : float (default = 2)
standard deviation for 2D Gaussian kernel
- sample : int (default = 100000)
If given, a n number of reads will be randomly selected instead of the entire dataset
- width : int (default = None)
With of the ploting area in pixel
- height : int (default = 600)
height of the ploting area in pixel
- plot_title : str (default = Mean read quality per sequence length)
Title to display on top of the plot
# Run cell with Ctrl + Enter p = pycoQC("./data/summary/*Albacore*DNA*") fig = p.reads_len_qual_2D () iplot(fig, show_link=False)
Sequencing output, quality and length over experiment time
pycoQC can generate plot showing the evolution of the sequencing output (output_over_time), the mean read quality (qual_over_time) and the read length (len_over_time) over the course of the sequencing run.
Please be aware that if there are multiple run IDs in the source file(s), pycoQC reorder the run IDS by decreasing throughput/second as explained in Initialisation. This means that the over_time plots could be wrong, particularly when mixing several runs together.
For both functions the argument smooth_sigma can be used to modulate the smoothing factor of the gaussian filter, if you are not satisfied with the default result.
The colors of both plots can be fully customised:
* cumulative_color and interval_color for output_over_time
* median_color, quartile_color and extreme_color for quality_over_time
jhelp(pycoQC.output_over_time)
pycoQC.pycoQC.output_over_time
Plot a yield over time
- cumulative_color : str (default = rgb(204,226,255))
Color of cumulative yield area (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-color
- interval_color : str (default = rgb(102,168,255))
Color of interval yield line (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-color
- time_bins (default = 500)
Number of bins to divide the time values in (x axis)
- sample : int (default = 100000)
If given, a n number of reads will be randomly selected instead of the entire dataset
- width : int (default = None)
With of the ploting area in pixel
- height : int (default = 500)
height of the ploting area in pixel
- plot_title : str (default = Output over experiment time)
Title to display on top of the plot
# Run cell with Ctrl + Enter p = pycoQC ("./data/summary/Albacore-1.2.1_basecall-1D-DNA_sequencing_summary.txt.gz") fig = p.output_over_time () iplot(fig, show_link=False)
jhelp (pycoQC.qual_over_time)
pycoQC.pycoQC.qual_over_time
Plot a mean quality over time
- median_color : str (default = rgb(250,128,114))
Color of median line color (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-color
- quartile_color : str (default = rgb(250,170,160))
Color of inter quartile area and lines (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-color
- extreme_color : str (default = rgba(250,170,160,0.5))
Color of inter extreme area and lines (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-col
- smooth_sigma : float (default = 1)
sigma parameter for the Gaussian filter line smoothing
- time_bins (default = 500)
Number of bins to divide the time values in (x axis)
- sample : int (default = 100000)
If given, a n number of reads will be randomly selected instead of the entire dataset
- width : int (default = None)
With of the ploting area in pixel
- height : int (default = 500)
height of the ploting area in pixel
- plot_title : str (default = Read quality over experiment time)
Title to display on top of the plot
# Run cell with Ctrl + Enter p = pycoQC ("./data/summary/Albacore-2.1.10_basecall-1D-DNA_sequencing_summary.txt.gz") fig = p.qual_over_time () iplot(fig, show_link=False)
jhelp (pycoQC.len_over_time)
pycoQC.pycoQC.len_over_time
Plot a read length over time
- median_color : str (default = rgb(102,168,255))
Color of median line color (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-color
- quartile_color : str (default = rgb(153,197,255))
Color of inter quartile area and lines (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-color
- extreme_color : str (default = rgba(153,197,255,0.5))
Color of inter extreme area and lines (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-col
- smooth_sigma : float (default = 1)
sigma parameter for the Gaussian filter line smoothing
- time_bins (default = 500)
Number of bins to divide the time values in (x axis)
- sample : int (default = 100000)
If given, a n number of reads will be randomly selected instead of the entire dataset
- width : int (default = None)
With of the ploting area in pixel
- height : int (default = 500)
height of the ploting area in pixel
- plot_title : str (default = Read length over experiment time)
Title to display on top of the plot
# Run cell with Ctrl + Enter p = pycoQC ("./data/summary/Albacore-2.1.10_basecall-1D-DNA_sequencing_summary.txt.gz") fig = p.len_over_time () iplot(fig, show_link=False)
Barcode distribution
When barcoding information is available, it is possible to generate a pie chart of the barcode count distribution. If no barcode information is available pycoQC throws an error.
It is not rare to have non-relevant barcodes detected at very low level. By default any barcode below 0.1% of the reads is excludes from the plot, but this can be changed with min_percent_barcode.
Similar to the previously described methods colors are customisable with colors.
jhelp(pycoQC.barcode_counts)
pycoQC.pycoQC.barcode_counts
Plot a mean quality over time
- colors : list of str (default = ['#f8bc9c', '#f6e9a1', '#f5f8f2', '#92d9f5', '#4f97ba'])
List of colors (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-color
- width : int (default = None)
With of the ploting area in pixel
- height : int (default = 500)
height of the ploting area in pixel
- plot_title : str (default = Percentage of reads per barcode)
Title to display on top of the plot
Albacore output example
# Run cell with Ctrl + Enter p = pycoQC ("./data/summary/Albacore-1.7.0_basecall-1D-DNA_sequencing_summary.txt.gz") fig = p.barcode_counts () iplot(fig, show_link=False)
Guppy output example
# Run cell with Ctrl + Enter p = pycoQC ( seq_summary_file="./data/summary/Guppy-2.1.3_basecall-1D-DNA_sequencing_summary.txt.gz", barcode_summary_file="./data/summary/Guppy-2.1.3_basecall-1D_DNA_barcoding_summary.txt.gz") fig = p.barcode_counts () iplot(fig, show_link=False)
Channels activity over time
Although the flowcell layout could be visually attractive (see https://github.com/mattloose/flowcellvis) this is not very informative on how the channels generate data during the run.
The channels_activity method generates a heatmap style plot showing the output over time per channel.
The number of channels can be changed to match Minion flowcells (512 default) or Promethion flowcells (3000).
The argument smooth_sigma can be used to modulate the smoothing factor of the gaussian smoothing filter
Colors can be changed with colorscale
jhelp(pycoQC.channels_activity)
pycoQC.pycoQC.channels_activity
Plot a yield over time
- colorscale : list (default = [[0.0, 'rgba(255,255,255,0)'], [0.01, 'rgb(255,255,200)'], [0.25, 'rgb(255,200,0)'], [0.5, 'rgb(200,0,0)'], [0.75, 'rgb(120,0,0)'], [1.0, 'rgb(0,0,0)']])
a valid plotly color scale https://plot.ly/python/colorscales/ (Not recommanded to change)
- smooth_sigma : float (default = 1)
sigma parameter for the Gaussian filter line smoothing
- time_bins (default = 150)
Number of bins to divide the time values in (y axis)
- sample : int (default = 100000)
If given, a n number of reads will be randomly selected instead of the entire dataset
- width : int (default = None)
With of the ploting area in pixel
- height : int (default = 600)
height of the ploting area in pixel
- plot_title : str (default = Output per channel over experiment time)
Title to display on top of the plot
# Run cell with Ctrl + Enter p = pycoQC ("./data/summary/Albacore-2.1.10_basecall-1D-DNA_sequencing_summary.txt.gz") fig = p.channels_activity () iplot(fig, show_link=False)
Generate a sequencing summary file from fast5 file
pycoQC comes with a small utility tool to generate a sequencing summary file when it is not available (say your genomic facility doesn't keep it).
The program can also attempt to extract additional information including the file path (include_path) corresponding to each read and the following fields:
* mean_qscore_template
* sequence_length_template
* called_events
* skip_prob
* stay_prob
* step_prob
* strand_score
* read_id
* start_time
* duration
* start_mux
* read_number
* channel
* channel_digitisation
* channel_offset
* channel_range
* channel_sampling
* run_id
* sample_id
* device_id
* protocol_run
* flow_cell
* calibration_strand
* calibration_strand
* calibration_strand
* calibration_strand
* barcode_arrangement
* barcode_full
* barcode_score
If a field in not found or invalid it is simply ignored for the current fast5 file.
Multiprocessing is supported to speed up the data extraction (threads)
If generated with the minimal default fields, the file is compatible with pycoQC.
# Run cell with Ctrl + Enter from pycoQC.Fast5_to_seq_summary import Fast5_to_seq_summary
# Run cell with Ctrl + Enter Fast5_to_seq_summary (fast5_dir="./data/fast5/", seq_summary_fn="./data/fast5/summary_sequencing.tsv", threads=6, verbose_level=1, fields=["mean_qscore_template", "called_events", "duration"]) !head {"./data/fast5/summary_sequencing.tsv"}
Check input data and options Start processing fast5 files 22 reads [00:00, 520.92 reads/s] Overall counts valid files: 22 fields found mean_qscore_template: 22 called_events: 22 duration: 22 fields not found Total reads: 22 / Average speed: 300.45 reads/s mean_qscore_template called_events duration 7.608 1615 24233 8.206 1649 24747 8.544 3740 56107 8.234 1827 27409 8.325 3846 57697 8.119 3181 47720 8.304 1547 23218 8.219 2080 31208 8.124 2978 44675
# Run cell with Ctrl + Enter Fast5_to_seq_summary (fast5_dir="./data/fast5/", seq_summary_fn="./data/fast5/summary_sequencing.tsv", threads=6, verbose_level=1, include_path=True) !head {"./data/fast5/summary_sequencing.tsv"}
Check input data and options Start processing fast5 files 22 reads [00:00, 719.82 reads/s] Overall counts valid files: 22 fields found read_id: 22 run_id: 22 channel: 22 start_time: 22 sequence_length_template: 22 mean_qscore_template: 22 calibration_strand_genome_template: 22 fields not found barcode_arrangement: 22 Total reads: 22 / Average speed: 313.08 reads/s read_id run_id channel start_time sequence_length_template mean_qscore_template calibration_strand_genome_template path 2c32553e-62c6-4c7a-bf05-249771364f04 40ebe55356ada6c830fa793745ef4c498d896c73 237 11 1151 8.544 filtered_out /home/aleg/Programming/pycoQC/docs/demo/data/fast5/20180625_FAH77625_MN23126_sequencing_run_S1_57529_read_10_ch_237_strand.fast5 e6a8e4d0-7b3c-471a-be26-fa7857d12663 40ebe55356ada6c830fa793745ef4c498d896c73 318 15 392 8.304 filtered_out /home/aleg/Programming/pycoQC/docs/demo/data/fast5/20180625_FAH77625_MN23126_sequencing_run_S1_57529_read_10_ch_318_strand.fast5 f8325de9-a77e-4616-a4a8-69ecf32e1688 40ebe55356ada6c830fa793745ef4c498d896c73 354 16 568 8.206 filtered_out /home/aleg/Programming/pycoQC/docs/demo/data/fast5/20180625_FAH77625_MN23126_sequencing_run_S1_57529_read_10_ch_354_strand.fast5 6af04302-04c8-4d8d-8e87-aa69178b3f24 40ebe55356ada6c830fa793745ef4c498d896c73 36 26 832 8.234 filtered_out /home/aleg/Programming/pycoQC/docs/demo/data/fast5/20180625_FAH77625_MN23126_sequencing_run_S1_57529_read_10_ch_36_strand.fast5 68804104-71dc-465c-b82d-3a99a4689701 40ebe55356ada6c830fa793745ef4c498d896c73 38 20 1010 8.325 filtered_out /home/aleg/Programming/pycoQC/docs/demo/data/fast5/20180625_FAH77625_MN23126_sequencing_run_S1_57529_read_10_ch_38_strand.fast5 37dfa1d5-5d84-486c-bf47-9eb6438f5645 40ebe55356ada6c830fa793745ef4c498d896c73 410 30 555 8.219 filtered_out /home/aleg/Programming/pycoQC/docs/demo/data/fast5/20180625_FAH77625_MN23126_sequencing_run_S1_57529_read_10_ch_410_strand.fast5 5b7fadd0-c646-4c7b-9800-66ee658a5ca8 40ebe55356ada6c830fa793745ef4c498d896c73 150 37 468 7.608 filtered_out /home/aleg/Programming/pycoQC/docs/demo/data/fast5/20180625_FAH77625_MN23126_sequencing_run_S1_57529_read_10_ch_150_strand.fast5 9a1c5296-2ab1-4abd-8d50-e059754cf332 40ebe55356ada6c830fa793745ef4c498d896c73 319 33 1235 8.119 filtered_out /home/aleg/Programming/pycoQC/docs/demo/data/fast5/20180625_FAH77625_MN23126_sequencing_run_S1_57529_read_10_ch_319_strand.fast5 3784283c-47cc-48ac-8d7b-7efd32123b56 40ebe55356ada6c830fa793745ef4c498d896c73 243 20 893 8.54 filtered_out /home/aleg/Programming/pycoQC/docs/demo/data/fast5/20180625_FAH77625_MN23126_sequencing_run_S1_57529_read_10_ch_243_strand.fast5